Now that we have a way of splitting up cells, it makes sense to take a look at what the cell type says about the cell type that is likely to follow it. I want to answer, for example:
Given that a cell is an 'import', what is the probability that we see an expression next?
Intuitively, we should expect order here. This could be the basis for further experiments: choosing classes to maximize the order we find in these probabilities.
There are two ways of looking at this problem, that differ in how we choose a class for a node:
We look at both of these ways of thinking using the following two metrics:
In [1]:
# Necessary imports
import os
import time
from nbminer.notebook_miner import NotebookMiner
from nbminer.cells.cells import Cell
from nbminer.features.ast_features import ASTFeatures
from nbminer.stats.summary import Summary
from nbminer.stats.multiple_summary import MultipleSummary
In [2]:
#Loading in the notebooks
people = os.listdir('../testbed/Final')
notebooks = []
for person in people:
person = os.path.join('../testbed/Final', person)
if os.path.isdir(person):
direc = os.listdir(person)
notebooks.extend([os.path.join(person, filename) for filename in direc if filename.endswith('.ipynb')])
notebook_objs = [NotebookMiner(file) for file in notebooks]
a = ASTFeatures(notebook_objs)
In [3]:
for i, nb in enumerate(a.nb_features):
a.nb_features[i] = nb.get_new_notebook()
In [4]:
from helper_classes.cond_computer import CondComputer
node_list = []
for i, nb in enumerate(a.nb_features):
node_list.append('start')
for cell in (nb.get_all_cells()):
t = type(cell.get_feature('ast').body[0])
node_list.append(t)
node_list.append('end')
cc = CondComputer(node_list)
In [5]:
arr, arr_names = cc.compute_probabilities(cc.count_totals,.01)
In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (20, 10)
cc.plot_bar(arr, arr_names, 'Probability per Node type')
In [7]:
cc.plot_conditional_bar(arr, arr_names, 0, 'Probability per Node type')
In [8]:
cc.plot_conditional_bar(arr, arr_names, 1, 'Probability per Node type')
In [9]:
cc.plot_conditional_bar(arr, arr_names, 2, 'Probability per Node type')
In [10]:
cc.plot_conditional_bar(arr, arr_names, 3, 'Probability per Node type')
In [11]:
cc.plot_conditional_bar(arr, arr_names, 4, 'Probability per Node type')
In [12]:
cc.plot_conditional_bar(arr, arr_names, 5, 'Probability per Node type')
In [ ]:
Now that we have a good idea of the predictive power of the node type, and how assigning each node to a class could work, lets take a look at how a clustering on a feature space works. The clustering was done by arbitrarily choosing bins such that each bin had roughly the same number of examples
In [13]:
ast_sizes = []
for i, nb in enumerate(a.nb_features):
nb.set_ast_size()
for el in nb.get_all_cells():
ast_sizes.append(el.get_feature('ast_size'))
In [14]:
bin_end = [4, 7, 9, 12, 15, 22, 36]
bin_count = {}
for el in bin_end:
bin_count[el] = 0
for num in ast_sizes:
for i in range(len(bin_end)):
if num < bin_end[i]:
bin_count[bin_end[i]] += 1
break
names = ['Less than ' + str(bin_end[0])]
for i in range(1,len(bin_end)-1):
names.append(str(bin_end[i-1]) + ' <= Num Nodes < ' + str(bin_end[i]))
names.append('Greater than' + str(bin_end[-1]))
In [15]:
for key in bin_count.keys():
print (key, bin_count[key])
In [16]:
size_features = []
for i, nb in enumerate(a.nb_features):
nb.set_ast_size()
size_features.append('start')
for el in nb.get_all_cells():
num = el.get_feature('ast_size')
for ind in range(len(bin_end)):
if num < bin_end[ind]:
size_features.append(ind)
break
size_features.append('end')
In [17]:
cc = CondComputer(size_features)
In [18]:
arr, arr_names = cc.compute_probabilities(cc.count_totals,0,np.arange(7))
In [19]:
cc.plot_bar(arr, names, 'Probability per Node size')
In [20]:
cc.plot_conditional_bar(arr, arr_names, 0, 'Probability per Node size', x_labels = names)
In [21]:
cc.plot_conditional_bar(arr, arr_names, 1, 'Probability per Node size', x_labels = names)
In [22]:
cc.plot_conditional_bar(arr, arr_names, 2, 'Probability per Node size', x_labels = names)
In [23]:
cc.plot_conditional_bar(arr, arr_names, 3, 'Probability per Node size', x_labels = names)
In [24]:
cc.plot_conditional_bar(arr, arr_names, 4, 'Probability per Node size', x_labels = names)
In [25]:
cc.plot_conditional_bar(arr, arr_names, 5, 'Probability per Node size', x_labels = names)
In [26]:
cc.plot_conditional_bar(arr, arr_names, 6, 'Probability per Node size', x_labels = names)
We find that there is definitely predictive power in both methods of defining classes. The results generally correspond with our preconcieved notion of how likely a certain cell type is to appear after another. There were some interesting correlations in the cell size experiments, and we are interested in both loooking into these correlations and also generating new features to test this with.
In [ ]: